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METHOD AND APPARATUS FOR given to the learning algorithm. More particularly, a user 

EXTRACTING DATA FROM DATA SOURCES must label the first few items that should be extracted from 

ON A NETWORK the particular Web page starting from the top of the page. 

These are assumed to be a complete list of items to be 
5 extracted up to this point. That is, it is assumed that any 

TECHNICAL FIELD unmarked text preceding the last marked item should not be 

Ihc present invention is directed to a method and appa- extracted. The learning system then learns a wrapper from 

ratus for extracting data from data sources on a network and, ^h^se examples, and uses it to extract data from the remam- 

more particularly, to a method and apparatus for learning "^^^ P^S^* Th^ ^^^^^^ wrapper can be used for 

general data extraction heuristics from known data extrac- ^0 o^^^^r Web pages with the same format as the page used m 

tion programs for respective data sources to obtain a general training. Therefore, m the learmng system, human mput is 

data extraction procedure. required to determme the page-specific wrapper. 

These problems are not limited to retrieving data from 

BACKGROUND OF THE INVENTION HTML documents. 'Riese problems exist for documents 

Computer networks are widely used to facilitate the ^^^^^ o° network, 

exchange of information. A network may be a local area Therefore, a general, page-independent data extraction 

network (LAN), a wide-area network (WAN), a corporate procedure was needed to enable a user to easily and accu- 

Intranet, or the Internet. rately extract data from data sources having many different 

The Internet is a series of inter-connected networks. Users „ formats. Additionally, an improved format-specific data 

connected to the Internet have access to the vast amount of extraction procedure was needed to accurately extract data 

information found on these networks. Online servers and ^^^^ ^^^^ sources. A procedure was also needed for deter- 

Internet providers allow users to search the World Wide Web ^ ^^^^ ^"^^ Possible data extraction procedures 

(Web), a globaUy connected network on the Internet, using available for accurately extractmg data from a data source, 

software programs known as search engines. The Web is a proscnt invention was developed to accomplish these 

collection of Hypertext Mark-Up Language (HITVIL) docu- objectives. 

ments on computers that are distributed over the Internet. SUMMARY OF THE INVENTION 

The collection of Web pages represents one of the largest ^.^ .. 

databases in the world. However, accessing information on '° ^'^^ °^ foregoing, it is a principal object of the 
individual Web pages is difficult because Web pages are not Pf^*"' mvenaon to provide a method and apparatus which 
a structured source of data. There is no standard organization ehminates the deficiencies of the pnor art. 
of information provided by a Web page, as there is in ^^^^ further object of the present invention to provide a 
traditional databases. method and apparatus for learning general data extraction 
Attempts have been made to address the problem of heuristics to generate a general data extraction procedure to 
accessing data from Web pages. For example, information 35 enable a user to extract data from a data source on a network, 
integration systems have been developed to allow a user to regardless of the format of the data source, 
query structured information that has been extracted from >s another object of the present invention to provide a 
the Web and stored in a knowledge base. In such systems, method and apparatus for learning a general data extraction 
information is extracted from Web pages using special- procedure and for using this procedure to learn a format- 
purpose programs or "wrappers". These special-purpose 40 specific wrapper. 

programs convert Web pages into an appropriate format for It is yet a further object of the present invention to provide 

the knowledge base. In order to extract data from a particular a method and apparatus for generating a ranked list of 

Web page, a user must write a wrapper, which is specific to wrappers available for accurately extracting data for a 

the format of that Web page. Therefore, a difi;erent wrapper particular data source on a network, 

must be written for the format of each Web page that is 45 These and other objects are achieved by the present 

accessed. Because data can be presented in many different invention, which according to one aspect, provides a method 

formats, and because Web pages frequently change, building and apparatus for learning a general data extraction proce- 

and maintaining wrappers and information integration sys- dure from a set of working wrappers and the data sources 

terns is time-consuming and tedious. they correctly wrap. New data sources that are correctly 

A number of proposals have been made for reducing the 50 wrapped by the learned procedure can be incorporated into 

cost of building wrappers. Data exchange standards such as a knowledge base. 

the extensible Markup Language (XML) have promise, but According to another aspect of the present invention, a 

such standards are not yet widely used. In addition, Web method and apparatus are provided for using the learned 

information sources using legacy formats, like HTML, will general data extraction heuristics for the general procedure 

be common for some time, and therefore, extraction meth- 55 to learn specific data extraction procedures for data sources, 

ods must be able to extract information from these legacy respectively. 

formats. Special languages for writing wrappers and semi- According to yet another aspect of the present invention, 

automated tools for wrapper construction have been a list of possible wrappers for a data source is generated, 

proposed, as well as systems that allow wrappers to be where the wrappers in the list are ranked according to 

trained from examples. However, none of these proposals 60 perforaiance level. 

eliminate the human effort involved in creating a wrapper for „ 

a Web page. Moreover, the training methods are directed to BRIEF DESCRIPTION OF THE DRAWINGS 

learning a wrapper for Web pages with a single, specific In the drawings: 

formal. Consequently, a new training process is required for FIG. 1 is a block diagram of a system according to the 

each Web page format. 65 present invention; 

More particularly, when a learning system is used, for FIG. 2 is a block diagram of one of the user stations 

example, it is necessary for a person to label the samples illustrated in FIG. 1; 
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FIG. 3 represents the HTML source code for a Web page 
from which data is to be extracted; 

FIG. 4 illustrates the data extracted from the Web page 
shown in FIG. 3; 

FIG. 5 illustrates the HTML parse tree for the Web page 
shown in FIG. 3; 

FIG. 6 illustrates a classification tree obtained by 
re- labeling the nodes of the HTML parse tree shown in FIG. 
5; 

FIG. 7 illustrates the flow of data for generating a general 
data extraction procedure; 

FIG. S illustrates an example of the rules obtained by the 
learning system associated with the general data extraction 
procedure; 

FIG. 9 is a flow diagram illustrating the steps required to 
generate the general data extraction procedure according to 
the present invention; 

FIG. 10 is a flow diagram illustrating the steps required to 
implement the format-specific data extraction procedure 
according to the present invention; 

FIG. 11 illustrates information on a sample Web page; 

FIG. 12 illustrates the HTML source code for a Web page 
including a simple list and the data extracted from this page 
according to the present invention; and 

FIG. 13 illustrates programs for recognizing structures in 
an HTML page according to the present invention. 

DETAILED DESCRIPTION 

In the description that follows, the present invention will 
be described in reference to preferred embodiments that 
operate on the Web. In particular, examples will be described 
which illustrate particular appHcations of the invention on 
the Web. The present invention, however, is not limited to 
any particular information source nor limited by the 
examples described herein. Therefore, the description of the 
embodinaents that follow is for purposes of illustration and 
not limitation. 

Referring to FIG. 1, users are connected to a network 10 
via user stations 12. The user stations 12 may be, for 
example, personal computers, workstations, mini- 
computers, mainframes, or a Web Server. The network 10 in 
the present invention may be any network such as a LAN, 
a wide-area network (WAN), a corporate Intranet, or the 
Internet, for example. Each of the user stations 12 usually 
includes a central processing unit (CPU) 14, a memory 16, 
a hard drive 18, a floppy drive 20, at least one input device 
22 and at least one output device 24, connected by a data bus 
26, as shown in FIG. 2. The input device 22 may be a 
keyboard or a mouse, while the output device 24 may be a 
display or a printer, for example. 

Learning Page-Independent Heuristics for 
Extracting Data from Web Pages 

Users connected to the Internet can access Web pages 
containing information on just about any imaginable subject. 
Web pages are represented as HTML documents. An 
example of the HTML source code for a sample Web page 
from which data is to be extracted is shown in FIG. 3. A 
wrapper, which is known to correctly wrap the page shown 
in FIG. 3, is used to extract the data shown in FIG. 4. In 
order to extract the data, the wrapper manipulates the HTML 
parse tree for the Web page. The HTML parse tree is a tree 
with nodes labeled by tag names such as body, table, and ul. 
The wrapper manipulates the HTML parse tree primarily by 



30 



deleting and re-labeling the nodes of the parse tree. In other 
words, the wrapper converts the HTML parse tree for the 
particular Web page into another tree labeled with terms 
compatible with a knowledge base, so that the extracted data 

5 can be stored directly into the knowledge base. 

In order to automatically learn a wrapper, it is necessary 
to use a learning system with a learning algorithm. Learning 
systems usually learn to classify information. More 
particularly, learning systems leara to associate a class label 

10 from a small, fixed, set of labels with an unlabeled instance. 
Therefore, to implement a classification learner on data 
extraction problems, it is necessary to re-cast the extraction 
problem as a labeling problem. Since each data item 
extracted from the Web page by the wrapper corresponds to 

15 a node in the HTML parse tree, the output of the wrapper can 
be encoded by appropriately labeling parse tree nodes. 
Therefore, each node in the HTML parse tree is an unlabeled 
instance which can be encoded by properly labeling the 
node. 

The HTML parse tree for the Web page shown in FIG. 3 
is illustrated in FIG, 5. The action of a wrapper can be 
encoded by labeling the nodes of the HTML parse tree as 
"positive" or "negative", where a node is labeled as "posi- 
tive" if the text of the node is extracted by the wrapper, and 
labeled as "negative" otherwise. In the HTML parse tree in 
FIG. 5, every<li> node would be labeled "positive", and all 
other nodes would be labeled "negative". This is shown in 
FIG. 6. Extracting the text of each <H> node yields the 
database entries shown in FIG. 4. 

Knowledge base entries are extracted from the Web page 
by the wrapper. The labeHng of the nodes in the HTML parse 
tree indicates which nodes contribute text to the knowledge 
base entries. Nodes labeled "positive" do contribute text, 
and nodes labeled "negative" do not contribute text to the 
knowledge base. 

In sum, the task of extracting data from a Web page can 
be recast as the task of labeling each node in the HTML 
parse tree for the page. A wrapper can be represented as a 
4Q procedure for labeling HTML parse tree nodes. Such a 
node-labeling procedure can be learned from a sample of 
correctly labeled parse tree nodes. A set of correctly labeled 
parse tree nodes, in turn, can be generated given an existing 
wrapper and a page that is correctly wrapped. 
45 The foregoing principles can be applied to learn general, 
page-independent heuristics to obtain a general data extrac- 
tion procedure. The general data extraction procedure can be 
used to extract data from Web pages regardless of Web page 
format. 

50 According to the present invention, general, page- 
independent heuristics for extracting data from Web pages 
are learned from a data set including data extracted from 
Web pages that have been correctly wrapped. More 
particularly, the input to the learning system according to the 

55 present invention is a set f working wrappers paired with the 
corresponding HTML Web pages they correctly wrap. The 
data extracted from the sample Web pages is stored in a 
database or, information integration system. The database 
may be any form of database suitable for storing data 

60 extracted from the Web, such as the database described in 
U.S. Pat. No. 6,295,533, entitled System and Method for 
Accessing Heterogeneour Databases and issued on Sep. 25, 
2001, called WHIRL, which is incorporated by reference 
herein. The data is then processed by a learning algorithm, 

65 such as the algorithm for Repeated Incremental Priming to 
Produce Error Reduction (RIPPER) disclosed in U.S. Pat. 
No. 5,719,692, entitled Rule Induction on Large Noisy Data 
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Sets and issued Feb. 17, 1998, which is incorporated be 
reference herein. Any suitable learning algorithm may be 
used. The learning algorithm generates procedure for 
extracting data from Web pages regardless of format. 

More particularly, a set of wrappers W^, . . . , Wj^ that 
correctly wrap the Web pages Pi, . . • , pjv are used to learn 
a general, page-independent data extraction procedure. 
Referring to FIG. 7, for each pair w^p,-, find the parse tree for 
the page p,-, and label the nodes in that tree according to the 
wrapper w^. This results in a set of labeled parse tree 
nodes<n, 1, l/ i>, .... <n,^„ 1i>m>*» which are added to a 
data set S, The data set S is used to train a classification 
learner, such as RIPPER. As with any learning procedure, 
the larger the data set, the better the performance. In 
experiments, 84 wrapper/Web page pairs were used to obtain 
the data set. The output of the learner is a general node- 
labeling procedure h, which is a function mapping parse tree 
nodes to the set of labels, {positive, negative, for example}: 

h: parse-tree- node-* {positive, negative}. 

The learned function h can then be used to label the parse 
tree nodes of new Web pages, and thereby extract data from 
these pages. 

In experiments, the classification learning system RIP- 
PER was used. RIPPER, like most classification learners, 
learns to associate a class label with an unlabeled instance; 
that is, it requires an example to be represented as a vector 
of relatively simple features. More particularly, in order to 
label HTML parse tree nodes as "positive" or "negative", 
they are encoded as learning "instances". A learning instance 
is a set of relevant features that the parse tree node has. One 
keeps track of which learning instance corresponds to which 
node in the tree. The label assigned to the learning instance 
is considered to be assigned to the parse tree node. Hence, 
labeling nodes and labeling learning instanced may be 
discussed interchangeably. The learning algorithm takes 
labeled instances (instances where the label is given by the 
known wrapper) and builds a classifier (general data extrac- 
tion procedure). The classifier consists of a set of rules that 
choose a label ("positive" or "negative") based on the 
features of the instance (e.g., features of the HTML parse 
tree node). The classifier can then be used to label new 
instances. 

Features which are plausibly related to the classification 
task may be used in the learning system. The value of each 
feature is either a real number, or a symbolic feature, such 
as an atomic symbol from a designated set like {positive, 
negative}. The primitive tests for each feature include a 
real-valued feature which is of the form f^9 or f^0, where 
f is a feature and 6 is a constant number, and the primitive 
tests for a symbolic feature is of the form f=a^-, where f is a 
feature and a^ is a possible value for f. RIPPER also allows 
set-valued features. The value of a set- valued feature is a set 
of atomic symbols, and tests on set-valued features are of the 
form a,^ f, where f is the name of a feature and a, is a possible 
value (e.g., ul eancestorTagNames), For two-class problems 
of this sort, RIPPER uses a number of heuristics to build a 
disjunctive normal form formula that generalizes the posi- 
tive examples. This formula is usually a set of rules, each 
rule having the form "label an instance * positive' if tl and t2 
and . . .", where each ti in the rule is the primitive test on 
some feature. 

In experiments, the following series of features were used 
to describe a parse tree node. The tag name feature is the 
HTML tag name associated with the node, such as a, p, br, 
and html. This is an informative feature because some tags 
such as "head" are always negative, while other tags such as 
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the anchor tag a are often positive. The size of the string si 
directly associated with a node was measured in two ways: 
(1) the text length which is the total size of all of the text 
associated with a node, and (2) the non-white text length 

5 which is similar to the text length, but ignores blanks, tabs, 
and line returns. The length of the text contained in the 
sub -tree rooted at the current node was measured by the 
features of recursive text length and recursive, non-white 
text length The string is directly associated with the node if 

10 it is contained in the HTML element associated with the 
node, and not contained in any smaller HTML element. 
These features are important because they measure the size 
of the string s^- that would be extracted if the node were 
marked as positive. Other natural and easily computed 

15 features include set of ancestor tag names, depth, number of 
children, number of siblings, parent tag name, set of child 
tag names, and set of descendent tag names. Since the size 
of parse trees vary, many of the above-identified features can 
be normalized by the total number of nodes or by the 

20 maximal node degree. 

The following features are intended to detect and quantify 
repeated structure in the parse tree. The repetitive aspect of 
a stmcture can often be detected by looking at the sequence 
of node tags that appear in the paths through the tree. To 

25 measure this repetition, let tag(n) denote the tag associated 
with a node n, and define the tag sequence position of n, 
p(n), as the sequence of tags encountered in traversing the 
path from the root of the parse tree to n, such as p(n)=<html, 
. . . , tag(n)>. If p(nl)=<tl, . . . tk>, and p(n2)=<tl, . . . tk, 

30 tk+1, . . . tm>, then it is determined that the tag sequence 
position p(nl) is a prefix of p(n2). If p(nl) is strictly shorter 
than p(n2), then it is determined that the tag sequence 
position p(nl) is a proper prefix of p(n2). 

The feature of the node prefix count for n is used as a way 

35 of measuring the degree to which n participates in a repeated 
substructure. The node prefix count for n, pcouJ(^\ is the 
number of leaf nodes 1 in the parse tree that the tag sequence 
position of n is a prefix of the tag sequence of 1. More 
formally, pcounM°\{^'V(^) ^ ^ tag sequence prefix of p(l), 1 

40 is a leaf}]. The node suflBx count for n, s(n), is closely 
related. The feature of the node suffix count is defined as the 
number of leaf nodes 1 with tag sequence positions of which 
p(n) is a proper prefix. Both , p^^^^n) and s^^„^n) can be 
normalized by the total number of paths in the parse tree. 

45 FIG. 8 illustrates some representative rules that appeared 
in the hypothesis obtained by training RIPPER on all of the 
84 wrapper/page sample pairs using the features noted 
above. 

The features discussed herein are illustrative of the types 

50 of features that may be used. They are not intended to 
represent necessary features or the only features to be used. 

The steps performed in determining the learned function 
h are shown in FIG. 9. Preprocessing is performed in step S 
i to obtain a Web page in step S2. The wrapper known to 

55 wrap the Web page is run in step S3 to obtain database 
entries in step S4. The nodes in the HTML parse tree are 
labeled in step S5 to generate the labeled tree in step S6. The 
features for each node are extracted in step S7 to label the 
unlabeled instanced in the tree in step S8. Steps S1-S8 are 

60 repeated for a plurahty. of Web pages which form the 
sampling for the learning algorithm. The data output from 
the repeated steps are supplied to the learning algorithm in 
step S9, which outputs the learned function h. 

According to the present invention, page formats that are 

65 correctly wrapped by the learned heuristics can be incorpo- 
rated into a knowledge base with minimal human effort. It 
is only necessary to indicate where in the knowledge base 
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the extracted data should be stored. In contrast, typical According to this embodiment, the information describing 

wrapper-induction methods require a human teacher to train each of the nodes of the HTML parse tree for the Web page 

them on each new page format. is stored in a database such as the database described in Ser. 

The learned format-independent data extraction heuristics No. 09/028,471. The information may include an identifier 

of the general data extraction procedure substantially 5 for °ode, the Tag associated with the node, the text 

improve the perfonnance of methods for learning page- associated with the node, and the position of the node within 

specific wrappers as well. More particularly, the method of P^^^ determination and database storage of the 

generating the general data extraction procedure can be used information for each node is well-known in the art, and 

to learn page-specific or format-specific wrappers. therefore, is not descnbed m detail herein. The text assoa- 

Page-specific wrapper induction is performed by training 10 ^^^^ processed to determine whether it is 

r i_ c . ' 1 similar to some known text or obiects of the type which are 

a new wrapper for each new page format using examples ^ i_ ^ * j *t. u / . - -i 

•c * fiT 1 u • 11 ^iT to be extracted from the Web page. The text similarity 

specinc to the particular Web page. Typically, the wrapper- . . ^ j • .i. « i x / 

. , . - * • J f 11 1 u 1 \u processing is performed usmg the well-known Vector-Space 

induction system is trained as follows. The user labels the ^i_ « r • . . t-l r 

ii ^ c •* *i_ * u L . * J r method for processmg text. Then, a score or measure of 

nrst tew items that should be extracted from the page • -i . j . • j r t. • • • . 

* A. *u * f«L T-u t • * similarity IS determmed for each position m the parse tree 

startmg from the top of the page. The learmng system then 15 . • r ^ l j .i i 

leams a wrapper ftom these examples, and it uses the ''^^ ""^Z ' °°. the particular 

. j ^ 4 * J * r * J r position. The positions within the parse tree are then ranked 

learned wrapper to extract data from the remainder of the j- . . . 1 . 

-rx, 1 lu -if*u according to score or perrormance to generate a ranked list 

page. The learned wrapper can also be used for other pages r t ^u- u j- . -.i n r 

. ., r * J • ^ ■ • 01 positions or wrappers. In this embodiment, either all of 

with the same or smiilar format as the page used in traimng. ^ i r 

. Page-specific wrappers can also be trained using the 29 'he positions or only a portion of the positions are pro^^^ 

approach set forth above with respect to the general data ranked. A deu. Jed description of this embodiment of the 

\r ^. J 1 j iT u • 7u *u * mvention is set forth below, 

extraction procedure, the only difference bemg the way that „ . . , , 

a data set is constructed, and the circumstances in which the Referring to FIG. 11, to a human reader, this text is 

learned wrapper is used. More particularly, in learning a Perceived as contaming a hst of three items, each containing 

page-specific wrapper, all training examples come from the 25 ^^^^^^^^ name of a university department, with the 

page being wrapped, and the learned classifier is only used university name underlined. This apparently meaningful 

to label parse tree nodes from the particular page (or other ^^^^}^^^ recognized without previous knowledge or 

pages with the same format as the particular page). ^^^^ ^^^^^S^ ^^^^ ungrammatical non-sense 

According to the present invention, a hybrid approach university names are imaginary. This suggests that 

may be used to obtain a page-specific wrapper. According to 30 P^^PI^ employ general-purpose, page-independent strate- 

the hybrid approach, page-independent heuristics are recognizing structure in documents. Incorporating 

learned from a pluraUty of pages other than the page to be strategies into a system that automaticaUy (or semi- 

wrapped. Then the learned page-independent heuristics are automatically) constructs wrappers would be very invalu- 
used to attempt to wrap the page. If the user accepts this 

wrapper, or if the performance goal is reached in simulation, 35 Described herein are effective strucUire recognition meth- 

the learning is terminated. Otherwise, the user is repeatedly CGTiam restricted types of hst strucmres that can be 

prompted for the next positive example, as in intra-page encoded compactly and naturally, given appropriate tool, 

learning, until the learned wrapper meets the performance ^ach of these methods can be implemented in the WHIRL 

goal. database program described in U.S. Pat. No, 6,295,533 

The steps performed in the hybrid method are shown in 40 entitled and issued on Sep. 25, 2001, noted above. However 
FIG. 10. In step SIO, labeled instances are built according to to be understood that the invention is not limited to the 
the steps set forth in FIG. 9. The labeled instances from step nsc of WHIRL, and can be implemented with any suitable 
S12 are supplied to a learning algorithm in step S14, The database. The WHIRLprogram is a "soft" logic that includes 
learning algorithm generates a classifier or page- both "soft" universal qualification, and a notion of textual 
independent data extraction procedure in step S16, Then the 45 similarity developed in the information retrieval (IR) corn- 
general data extraction procedure is then run to wrap the munity. The structure-recognition methods set forth herein 
page. Preprocessing is performed in step S20 to obtain the ^^^^ °" natural heuristics, such as detection repetition 
page in step 322. Then, features are extracted from the page sequences of markup commands, and detecting repeated 
in step S24 and unlabeled insUnces obtained in step S26 are patterns of "familiar-looking" strings. 
suppUed to the data extraction procedure or classifier in step 50 The methods can be used in a page-independent manner; 
S28. The data extraction procedure run in step 828 outputs ^hat is, given an HTML page, but no additional information, 
labeled instances of the HTML parse tree in step S30, In step the methods produce a ranked list of proposed "structures" 
S32, it is determined whether the performance of the general found in the page. By providing different types of informa- 
data extraction procedure is acceptable. If it is acceptable, tion about a page, the same methods can also be used for 
then the operation is terminated. If the performance is not 55 page-specific wrapper learning or for updating a wrapper 
acceptable, the process returns to step S12 via step S18 after the format of a wrapped page has changed, 
where the user labels example to provide more information The structure-recognition problem will be discussed first, 
to the learning algorithm. The structure of a Web page is rated as meaningful or not 
„ ^ . «r . ,r . meaningful. The stmctm-e in a Web page would be rated as 
Recognizmg Structure m Web Pages Usmg meaningfiil if it contains structured information that could 
Similarity Queries plausibly be extracted from the page. In experiments, pages 

Another embodiment of the invention is directed to gen- that were actually wrapped were used, and a structure was 

crating a list of proposed wrappers for wrapping a Web page. considered meaningful if it corresponded to information 

The wrappers in the list can be ranked according to perfor- actually extracted by an existing, hand-coded wrapper for 

mance. The selection of the particular wrapper to use to 65 that page. 

extract data from the Web page can either be performed by In the following description, wrappers for two narrow 

the user or automatically. classes of wrappers are discussed. However, it is to be 
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understood that the discussion is for illustrative purposes the following query might be used to find papers written 

and the invention is not limited to the two narrow classes of by editorial board members: 

wrappers. The wrapper classes are simple lists and simple cd_board(X) Apapcr(Y,z,u) ax~y 

hothsts. In a page containing a simple list, the information ™ .... ... i- . ^ t_ f^ 

, , . ^ I , * . • * r . • . The answer to this query would be a list of substitutions 9, 
extracted IS a one-column relation contaming a set of stnngs 5 . • . ^ o u . *u * u* j v J 
c. J . 11 *u * * .1. * r 11 L 1 each with an associated score. Substitutions that bmd X and 
s, . . , S^, and each s, is all the text that falls below some ^ to similar documents would be scored higher. One high- 
node n, m the HTML parse tree for the page. In a smiple substitution might bind X to "Phoebe L. MM, 
holhst, the extracted information is a two-column relation, ^^^^^ j^^n y to "Phoebe Mind" 
contaimng a set of pairs (S„ U,), . . . , (S„, U^); each s, is ^^^^^^ ^^ere will be provided a brief discussion of 
all the text that falls below some node n, in the HTMLparse lo WHIRL. As noted above, a more detailed description is 
tree; and each u,- is a URL that is associated with some foynj the U.S. Pat. No. 6,295,533, entitled and issued on 
HTML anchor element a,, that appears somewhere inside n,-. Sep. 25, 2001, which is incorporated by reference herein. 
FIG. 12 shows the HTML parse source for a simple list and WHIRL Semantics 

the extracted data, and FIG. 3 illustrates a simple hotlist and ^ WHIRL program consists of two parts: an exiensional 

the extracted data, is database (EDB), and an intensional database (IDB). The 

Vector Space Representation for Text ^ non-recursive set of function -free definite clauses. 

The text m FIG. 11 can be understood, m part, because of eDB is a coUection of ground atomic facts, each 

the regular appearance of the substrings that are recognized associated with a numeric score in the range (0, 1). In 

as (ficutious) university names. These strmgs are recogniz- addition to the types of literals normaUy aUowed in a DDE, 

able because they "look like" the names of real universities. 20 clauses in the IDB can also contain similarity literals of the 

Implementing such heuristics requires a precise notion of f^^^ x~Y, where X and Y are variables. A WHIRL predicate 

similarity for text, and one such notion is provided by the definition is caUed a view. For purposes of this discussion, 

well-known vector space model of text. ^^^^^ ^re assumed to be flat; that is, each clause body in the 

In the vector space model, a piece of text is represented view contains only literals associated with predicates 

as a document vector. 'The vocabulary T of terms are word 25 defined in the EDB. Since WHIRL does not support 

stems produced by the well-known Porter stemmmg algo- recursion, views that are not flat can be "flattened" 

nthm. A document vector is a vector of real numbers (unfolded) by repeated resolution. 

V eR|T|, each component of which corresponds to a term .1° a conventional DDB, the answer to a conjunctive query 

, r-*L-t J would be the set of ground substitutions that make the query 

H .""^r^^ mr If. ^T"" / '° !f 1° WHIRL, the notion of provability may be replaced 

and employ the TF-IDF weightmg scheme for a document ^^^^ ^ ..^^f,,. „„^J„„ ^^^^^^^ ^^j^^ ^ ^^^^^ 

vector V appearing in a collection C, where v, is zero if the 0 be a ground substitution for B. If B=p(Xi, . . . XJ 

term t does not occur in text represented by T, and other- m^^""^." ^"^"^'T. 

_^ ^ ^ (B, e)=s if BO is a fact in the EDB with score s, and 

wise let v=(log(TF V , ,)+l)-log(IDF,). In this formula, TF 35 SCORE(B,0)=O otherwise. If B is a similarity literal X~Y, 

v , , is the number of times that term t occurs in t he then SCORE(B,e)=SIM(xiy), where T=X0and "y ="^6. If 

document represented by 7, and IDF,-|C|/|CJ, where Q is ^^^A- • • AB is a conjunction of literals, then SCORE 

the set of documents in C that contaiil t. (B,e)on,.,SCORE(B,,e). FinaUy, consider a WHIRL view. 

In the vector space model, the similarity of two document 40 ^^^^^^ a set of clauses of the form A,-Body,. For a 

_> ground atom a that is an instance of one or more A/s, the 

vectors v and wis given by the formula SIMS(v,w)«2teT(v, support of a, SUPPORT(a), is defined to be the set of all 

w^l'v || ||w||. Notice that SIM(v^) is always between zero P^^^ ^^Vd such that A,a-a, Body,a is ground, and 

and one, and that similarity is large only when the two SCORE(Body„ a)— 0. The score of an atom a (for this view) 

vectors share many important (highly weighted) terms. 45 ^ defined to be 

WHIRL Logic l-n (l-SCORE(Body,, a)). 

Whirl is a logic in which the fundamental items that are 

manipulated are not atomic values, but entities that corre- (o, BODY)e SUPPORT(a) 

spond to fragments of text. Each fragment is represented This definition follows firom the usual semantics of logic 

internally as a document vector. This means that the simi- 50 programs, together with the observation that if e^ and are 

larity between any two items can be computed. In brief, independent events, then Prob(ei V e2)=l-(l-Prob(ei))(l- 

WHIRL is a non-recursive, function-free Prolog with the Prob(e2)). 

addition of a built-in similarity predicate; rather than being The operations most commonly performed in WHIRL are 

true or false, a similarity literal is associated with a real- to define and materialize views. To materialize a view, 

valued "score" between 0 and 1, and scores are combined as 55 WHIRL finds a set of ground atoms a with non-zero score 

if they were independent probabilities. sa for that view, and adds them to the EDB. Since in most 

As an example of a WHIRL query, suppose that the cases, only high-scoring answers will be of interest, the 

information extracted from the simple list of FIG. 12 is materialization operator takes two parameters: r^ an upper 

stored as a predicate ed_board(X). Suppose also that the bound on the number of answers that are generated, and e, 

information extracted from the hotlist in FIG. 12, together 60 a lower bound on the score of answers that are generated, 

with a number of similar bibliography hotlists, has been Although the procedure used for combining scores in 

stored in a predicate paper (Y, Z, U), where Y is an author WHIRL is naive, inference in WHIRL can be implemented 

name, Z is a paper title, and U is a paper URL. For instance, quite efficiently. This is particularly true if e is large or r is 

the following facts may have been extracted and stored: . small, and if certain approximations are allowed. 

ed_board(" Phoebe L. Mind, Laugh Tech"), and paper 65 The "many** construct 

(Phoebe Mind", "A linear-time version of GSAT". The structure-recognition methods described herein 

"http://.../peqnp.ps"). Using WHIRL'S similarity predicate require a recent extension to the WHIRL logic: a "soft" 
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version of universal quantification. This operator is written case, the clause template need not be explicitly represented; 

many(Template, Test) where the Test is an ordinary con- a wrapper pieces for a page2 variant can be represented 

junction of literals, and the Template is a single literal of the simply as a pair of constants (e.g., "html_body_ul_li" and 

form p(Yi, . . . , Y„), where p is an EDB predicate and the "li_a")» and a wrapper piece for a page 1 variant can be 

Y/s are all distinct; also, the Y/s may appear only in Test. 5 refreshed as a single constant (e.g., html_body_table_tr„ 
The score of a "many" clause is the weighted average score 

td). 

of the Test conjunction on items that match the Template. For sake of description, only the methods that recog- 

More fonnaUy, for a substitution e and a conjunction W, «™Pl« ^othst structures analogous to page2 are 

discussed, and assume that structures are encoded by a pair 

scoRE(many(pCYi, . . . , Yj,Test),e)=Zs/s- SCORE Crest,(9-{Y- of constants Path 1 and Path2. However, the methods 

a,}^), (s,ai, . . . ajt)ep discussed herein recognize simple lists as well. 

Enumerating and Ranking Wrappers Three structure- 
where P is the set of all tuples (s,aj, . . . , a^^) such that recognition methods are discussed below. Assume that some 
p(ai, ... a^ is a fact in the EDB with score s; S is the sum page of interest has been encoded in WHIRL'S EDB (or in 
of all such scores s; and {Y«a,-},. denotes the substitution some EDB), materializing the WHIRL view possible„piece, 
{Yj^a^, . . .Yk=Si^}- shown in FIG. 13, will generate all wrapper pieces that 
As an example, the following WHIRL query is a request would extract at least one item from the page. The 
for editorial board members that have written "many" papers extracted_by view determines which items are extracted by 
on neural networks. each wrapper piece, and hence acts as an interpreter for 

q(X)-ed_board(X) Amany(papers(Y,Z,W), (X~Y wrapper pieces. 

AZ~"neural networks")). 20 Using these views in conjunction with WHIRL'S soft 

Recognizing Structure with WHIRL universal quanitification, one can compactly state a number 

a. Encoding HTML Pages and Wrappers of plausible recognition heuristics. One heuristic is to prefer 

To encode an HTML page in WHIRL, the page is fii^t P^^^.^^. ^^^^ ^J^^ract many items; this trivial but 

parsed. The HTML parse tree is then represented with the ""^^^ t™^. ^"'S^^i^ firuitful^iece view. Recall 

followin EDB redicates 25 ^^^^ materiauzing a WHIRL view results m a set of new 

^ P * atoms, each with an associated score. The fruitful_piece 

elt(Id,Tag,Text,Position) is true if Id is the identifier for a ^{^^ ^an thus be used to generate a ranked list of proposed 

parse tree node, n, Tag is the HTML tag associated with "stnictures" by simply presenting all firuitful_piece facts to 

n, Text is all of the text appearing in the subtree rooted the user in decreasing order by score, 

at n, and Position is the sequence of tags encountered Another structure-recognition method is suggested by the 

in traversing the path from the root to n. The value of observation that in most hotlists, the text associated with the 

Position is encoded as a document containing a single anchor is a good description of the associated object. This 

term t^^, which represents the sequence e.g., t^^= suggests the anchorlike_piece view, which adds to the 

"html_body_ul_li". fruitful__piece view an additional "soft" requirement that the 

attr(Id AName AValue) is true if Id is the identifier for , ^^^^ ^extl extracted by the wrapper piece be similar to the 

node n, AName is the name of an HTML attribute t^^l Text2 associated with the anchor element 

associated with n, and AValue is the value of that .^^^^^^^ structure-recognition method is the RJike^ 

attribute piece view. This view is a copy of fruitnil_piece view m 

. t\rx, X . .^^ . , ^ which the requirement that many items are extracted is 

path(Fromld,ToId,Tags) is true if Tags is the sequence of ^tpU^ed by a requirement that many "R like" items are 

HTML tags encountered on the path between nodes extracted, where an item is "R like" if it is similar to some 

Fromld and Told. This path includes both endpoints, g^^ond item X that is stored in the EDB relation R. The 

and is defined if Fromld^Told. ^^^oft" semantics of the many construct imply that more 

As an example, wrappers for the pages in FIG. 12 can be ^.^^it is given to extracting items that match an item in R 

written using these predicates as follows: ^^^^^^^ ^.^edit is given for weaker matches. As an 

page 1 (NameA£ai)*-elt(_, , NameAfifil, "html„body_ 45 example, suppose that R contains a list of all accredited 

table_tr_td"); universities in the United States. In this case, the R__like_ 

page2(Title,Url)<-elt(ContextElt, _, Title, "html_body_ piece would prefer wrapper pieces that extract many items 

ul__li") that are similar to some known university name; this might 

Apath(ContextElt, AnchorElt, "li_a") be useful in processing pages like the one shown in FIG. 11. 

Aattr(AnchorElt, "href, Url). 50 In experiments, the R_like_piece view provided the best 

Next, a discussion of appropriate encoding of "structures" results, 

is provided. Most simple lists and simple hotlists can be Maintaining A Wrapper 

wrapped with some variant of either the page 1 or the page2 Maintaining wrappers is a time-consuming process, 
view, in which the constant strings (e.g., "html_body__ul li" However, the invention proposes a new source of informa- 
and "li_a") are replaced with different values. Many of the 55 tion by retaining, for each wrapper, the data that was 
remaining pages can be wrapped by views consisting of a extracted from the previous version of the page. If the format 
disjunction of such clauses. of the page has been changed, but not its content, then the 
A new construct is required to formally represent the previously-extracted data can be used as page-specific train- 
informal idea of a "structure" in a structured document: a ing examples for the new page format, and the examplelike 
wrapper piece. In the most general setting, a wrapper piece 60 method can be used to derive a new wrapper". If the format 
consists of a clause template (e.g., a generic version of page2 and the content both change, then the data extracted from the 
above), and a set of template parameters (e.g., the pair of old version of the page could still be used; however, it would 
constants "html_body_ul__li" and "li_a"). In the experi- only be an approximation to the examples that a user would 
ments discussed below, only two clause templates were provide. Using such "approximate examples" will presum- 
cbnsidered, the ones suggested by the examples above — and 65 ably make structure-recognition more difficult, 
it is assumed that the recognizer knows, for each page, if it While particular embodiments of the invention have been 
should look for list structures or hotlist structures. In this shown and described, it is recognized that various modifi- 
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cations thereof will occur to those skilled in the art. 
Therefore, the scope of the herein-described invention shall 
be Umited solely by the claims appended hereto. 
What is claimed is: 

1. A method of extracting data from data sources in a 5 
network, said method comprising: 

inputting a plurality of pairs of data from said network, 
each pair comprising a data source and a program 
which accurately extracts data from said data source; 

determining, for each of said pairs of data, a parse-tree for 
said data source, said parse tree comprising nodes; 

labeling, for each of said pairs of data, said nodes of said 
parse tree for said data source according to said pro- 
gram for extracting data from said data source to obtain 
labeled parse tree nodes; 15 

generating a data set from said labeled parse tree nodes 
obtained for each of said pairs of data; and 

training a learning algorithm with said data set to learn a 
general program for labeling parse tree nodes of new 
data sources to extract data from said new data sources. 20 

2. The method according to claim 1, further comprising 
storing said data set in a data base. 

3. The method according to claim 1, wherein said learning 
algorithm is a rule learning program for Repeated Incre- 
mental Pruning to Produce Error Reduction (RIPPER). 25 

4. The method according to claim 1, wherein said data 
sources and said new data sources are Web pages from the 
World Wide Web. 

5. The method according to claim 1, further comprising: 
inputting a new data source; 

determining a new parse tree for a new data source, said 

new parse tree comprising nodes; 
labeling said nodes of said new parse tree for said new 

data source according to said general program to obtain 

new labeled parse tree nodes; and 
extracting data from said new data source based upon said 

new labeled parse tree nodes. 

6. The method according to claim 5, wherein said new 
data source is a Web page from the World Wide Web. 

7. A computer-readable medium storing computer- 
executable instructions for performing a method of extract- 
ing data from data sources in a network, comprising: 

inputting a plurality of pairs of data, each pair comprising 
a data source and a program which accurately extracts 
data from said data source; 

determining, for each of said pairs of data, a parse tree for 
said data source, said parse tree comprising nodes; 

labeling, for each of said pairs of data, said nodes of said 
parse tree for said data source according to said pro- 50 
gram for extracting data from said data source to obtain 
labeled parse tree nodes; 

generating a data set from said labeled parse tree nodes 
obtained for each of said pairs of data; and 

training a learning algorithm with said data set to leara a 55 
general program for labeling parse tree nodes of new 
data sources to extract data from said new data sources. 

8. The computer-readable medium according to claim 7, 
further comprising computer-executable instructions for 
storing said data set in a database. 60 

9. The computer-readable medium according to claim 7, 
wherein said learning algorithm is a rule learning program 
for Repeated Incremental Pruning to Produce Error Reduc- 
tion (RIPPER). 

10. The computer-readable medium according to claim 9, 65 
wherein said data sources and said new data sources are Web 
pages from the World Wide Web. 
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11. The computer-readable medium according to claim 9, 
further comprising computer-executable instructions for per- 
forming the steps of: 

inputting a new data source; 

determining a new parse tree for a new data source, said 

new parse tree comprising nodes; 
labeling said nodes of said new parse tree for said new 

data source according to said general program to obtain 

new labeled parse tree nodes; and 
extracting data from said new data source based upon said 

new labeled parse tree nodes. 

12. The computer-readable medium according to claim 
11, wherein said new data source is a Web page from the 
World Wide Web. 

13. An apparatus for extracting data from data sources in 
a network, comprising: 

means for inputting a plurality of pairs of data, each pair 
of data comprising a data source and a program which 
accurately extracts data from said data source; 

means for determining, for each of said pairs of data, a 
parse tree for said data source, said parse tree compris- 
ing nodes; 

means for labeling, for each of said pairs of data, said 
nodes of said parse tree for said data source according 
to said program for extracting data from said data 
source to obtain labeled parse tree nodes; 

means for generating a data set from said labeled parse 
tree nodes obtained for each of said pairs of data; and 

means for training a learning algorithm with said data set 
to learn a general program for labeling parse tree nodes 
of new data sources to extract data from said new data 
sources. 

14. The apparatus according to claim 13, further com- 
prising means for storing said data set in a database. 

15. The apparatus according to claim 13, wherein said 
learning 

algorithm is a rule learning program for Repeated Incre- 
mental Pruning to Produce Error Reduction (RIPPER). 

16. The apparatus according to claim 13, wherein said 
data sources and said new data sources are Web pages from 
the World Wide Web. 

17. A method of extracting data from data sources in a 
network, said method comprising: 

inputting a plurality of pairs of data from said network, 
each pair comprising a data source and a program 
which accurately extracts data from said data source; 

determining, for each of said pairs of data, a parse tree for 
said data source, said parse tree comprising nodes; 

labeling, for each of said pairs of data, said nodes of said 
parse tree for said data source according to said pro- 
gram for extracting data from said data source to obtain 
labeled parse tree nodes; 

generating a data set from said labeled parse tree nodes 
obtained for each of said pairs of data; 

training a learning algorithm with said data set to leara a 
general program for labeling parse tree nodes of new 
data sources to extract data from said new data sources; 

inputting a new data source; 

processing said new data source in accordance with said 
general program to extract data from said new data 
source; 

determitiing whether performance of said general pro- 
gram meets a predetermined threshold based upon the 
data extracted from said new data source; 
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when said performance of said general program does not 
meet said predetermined threshold, prompting a user to 
input a specific parse tree label for a specific node on 
said new data source; 

training said leaming algorithm with said data set and said 
specific parse tree label for said specific node to learn 
a specific program for labeling parse tree nodes of said 
new data source to extract data from said new data 
source; and 

repeating said prompting step and said learning step for 
said new data source until performance of said specific 
program meets said predetermined threshold. 

18. ITie method according to claim 17, further comprising 
storing said data set in a database. 

19. The method according to claim 17, wherein said 
leaming 

algorithm is a rule learning program for Repeated Incre- 
mental Pruning to Produce Error Reduction (RIPPER). 

20. The method according to claim 17, wherein said data 
sources, said new data sources, and said new data source are 
Web pages from the World Wide Web. 

21. A computer- readable medium storing computer- 
executable instructions for performing a method of extract- 
ing data from data sources in a network, comprising: 

inputting a plurality of pairs of data from said network, 
each pair comprising a data source and a program 
which accurately extracts data from said data source; 

determining, for each of said pairs of data, a parse tree for 
said data source, said parse tree comprising nodes; 

labeling, for each of said pairs of data, said nodes of said 
parse tree for said data source according to said pro- 
gram for extracting data from said data source to obtain 
labeled parse tree nodes; 

generating a data set from said labeled parse tree nodes 
obtained for each of said pairs of data; 

training a leaming algorithm with said data set to leara a 
general program for labeling parse tree nodes of new 
data sources to extract data from said new data sources; 

inputting a new data source; 

processing said new data source in accordance with said 
general program to extract data from said new data 
source; 

determining whether performance of said general pro- 
gram meets a predetermined threshold based upon the 
data extracted from said new data source; 

when said performance of said general program does not 
meet said predetermined threshold, prompting a user to 
input a specific parse tree label for a specific node on 
said new data source; 

training said leaming algorithm with said data set and said 
specific parse tree label for said specific node to learn 
a specific program for labeling parse tree nodes of said 
new data source to extract data from said new data 
source; and 

repeating said prompting step and said leaming step for 
said new data source until performance of said specific 
program meets said predetermined threshold. 
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22. The computer-readable medium according to claim 
21, further comprising storing said data set in a database. 

23. The computer-readable medium according to claim 
21, wherein said learning algorithm is a rule learning pro- 
gram for Repeated Incremental Pruning to Produce Error 
Reduction (RIPPER). 

24. The computer-readable medium according to claim 
21, wherein said data sources, said new data sources, and 
said new data source are Web pages from the World Wide 
Web. 

25. An apparatus for extracting data from data sources in 
a network, comprising: 

means for inputting a plurality of pairs of data from said 
network, each pair comprising a data source and a 
program which accurately extracts data from said data 
source; 

means for determining, for each of said pairs of data, a 
parse tree for said data source, said parse tree compris- 
ing nodes; 

means for labeling, for each of said pairs of data, said 
nodes of said parse tree for said data source according 
to said program for extracting data from said data 
source to obtain labeled parse tree nodes; 

means for generating a data set from said labeled parse 
tree nodes obtained for each of said pairs of data; 

means for training a learning algorithm with said data set 
to learn a general program for labeling parse tree nodes 
of new data sources to extract data from said new data 
sources; 

means for inputting a new data source; 

means for processing said new data source in accordance 

with said general program to extract data from said new 

data source; 

means for determining whether performance of said gen- 
eral program meets a predetermined threshold based 
upon the data extracted from said new data source; 

means for prompting a user to input a specific parse tree 
label for a specific node on said new data source when 
said performance of said general program does not 
meet said predetermined threshold; 

means for training said learning algorithm with said data 
set and said specific parse tree label for said specific 
node to learn a specific program for labeling. parse tree 
nodes of said new data source to extract data from said 
new data source; and 

means for repeating said prompting step and said learning 
step for said new data source until performance of said 
specific program meets said predetermined threshold. 

26. The apparatus according to claim 25, further com- 
prising means for storing said data set in a database. 

27. The apparatus according to claim 25, wherein said 
leaming 

algorithm is a mle learning program for Repeated Incre- 
mental Pmning to Produce Error Reduction (RIPPER). 

28. The apparatus according to claim 25, wherein said 
data sources, said new data sources, and said new data 
source are Web pages from the World Wide Web. 
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